Learning in Markov Games with Incomplete Information

نویسنده

Junling Hu

چکیده

The Markov game (also called stochastic game (Filar & Vrieze 1997)) has been adopted as a theoretical framework for multiagent reinforcement learning (Littman 1994). In a Markov game, there are n agents, each facing a Markov decision process (MDP). All agents’ MDPs are correlated through their reward functions and the state transition function. As Markov decision process provides a theoretical framework for singleagent reinforcement learning, Markov games provide such a framework for multiagent reinforcement learning. In my thesis, I expand the framework of Littman of a 2-player zero-sum Markov game to a 2-player generalsum Markov game. In a zero-sum game, two players’ rewards always sum to zero for any situation. That means one agent’s gain is always the other agent’s loss, thus agents have strictly opposite interests. In a general-sum game, agents’ rewards can sum to any number. Agents may have incentive to cooperate if they all receive positive rewards in certain situations. Thus general-sum games include zero-sum games as a special case. The solution concept for general-sum games is Nash equilibrium, which requires every agent’s strategy to be the best response to the other agents’ strategies. In a Nash equilibrium, no agent can gain by unilateral deviation. In Markov games with incomplete information, agents cannot observe the payoff functions of other agents. Thus they need to form certain beliefs about other agents by learning during the interactions with other agents. The learning must be online in dynamic systems (Hu ~ Wellman 1998). I aim to design a online Q-learning algorithm for Markov games, and prove that it converges to a Nash equilibrium. Before designing an algorithm, I defined the equilibrium concept for Markov games under incomplete information. The definition has the following requirements: (1) Each agent’s belief must be consistent with the actual outcome; (2) Each agent’s strategy must the best response to its belief about the other agent. When the system reaches equilibrium, there will be no more changes in agents’ strategies and their beliefs. We

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Utilizing Generalized Learning Automata for Finding Optimal Policies in MMDPs

Multi agent Markov decision processes (MMDPs), as the generalization of Markov decision processes to the multi agent case, have long been used for modeling multi agent system and are used as a suitable framework for Multi agent Reinforcement Learning. In this paper, a generalized learning automata based algorithm for finding optimal policies in MMDP is proposed. In the proposed algorithm, MMDP ...

متن کامل

Markov Games of Incomplete Information for Multi-Agent Reinforcement Learning

Partially observable stochastic games (POSGs) are an attractive model for many multi-agent domains, but are computationally extremely difficult to solve. We present a new model, Markov games of incomplete information (MGII) which imposes a mild restriction on POSGs while overcoming their primary computational bottleneck. Finally we show how to convert a MGII into a continuous but bounded fully ...

متن کامل

Multiagent Reinforcement Learning in Stochastic Games

We adopt stochastic games as a general framework for dynamic noncooperative systems. This framework provides a way of describing the dynamic interactions of agents in terms of individuals' Markov decision processes. By studying this framework, we go beyond the common practice in the study of learning in games, which primarily focus on repeated games or extensive-form games. For stochastic games...

متن کامل

Markov Games with Frequent Actions and Incomplete Information - The Limit Case

We study a two-player, zero-sum, stochastic game with incomplete information on one side in which the players are allowed to play more and more frequently. The informed player observes the realization of a Markov chain on which the payoffs depend, while the non-informed player only observes his opponent’s actions. We show the existence of a limit value as the time span between two consecutive s...

متن کامل